Graphing the Pandemic Era: Unraveling Opioid Treatment Policy Shifts Amidst COVID-19¶

Boyang Su (modified from a notebook written by J Montgomery Maxwell)¶

Goals for This Notebook¶

  • Gain a general understanding of the BRH data mesh.
  • Explore the BRH workspace.
  • Perform data manipulation and create engaging data visualizations using Python within the workspace.

Introduction to BRH¶

The Biomedical Research Hub (BRH) is a cloud-based data mesh that integrates multiple data commons into a single platform, where users can not only access clinical data but also analyze data in the workspace provided by BRH framework services.

Q: What is a data mesh?¶

No description has been provided for this image

In summary, data is the basic information, databases store and organize that data, data commons facilitates sharing data openly, and data mesh is a way of organizing data management in large organizations to promote decentralization and collaboration. Data Commons and Data Mesh can be considered more advanced concepts and approaches that build upon the foundation of data and databases.

Q: What is BRH?¶

The Biomedical Research Hub (BRH) is a cloud-based data management model, where various data resources are hosted on the cloud and are built on a core set of cloud services. These resources are interconnected and can be accessed and analyzed through data portals, workspaces, and various applications. Each data resource has its own data model and cloud-based applications that access and work with the data. The system allows federated queries, meaning it can pull data from different resources and present it as if it were from a single source.

BRH currently holds a substantial amount of data, over 6 PB (petabytes), which is 6,000,000 GB, from more than 400,000 research participants. More importantly, the system is not just a data repository; it enables various applications and cloud-based workspaces to access and work with the data in the repository. BRH is one of the first federated systems that combines the three principles:

  1. utilizing a set of common cloud-based core software services (framework services)
  2. employong a common security framework (NIST SP 800-53r4) to ensure data security
  3. following a principle that allows both users and environments to host sensitive data allowing multiple organizations to manage their own data repositories while still providing users the ability to access data from two or more repositories in the federated system.

Overall, the BRH model offers a cloud-based distributed repository solution for managing biomedical research data while ensuring data security, interoperability, and independence of data resource management for different organizations (reference).

For more detailed information, please refer to the BRH documentation.

Q: What are the resources on BRH?¶

Below is a list of data commons available for access on BRH:

  • JCOIN Data Commons
    • Justice Community Opioid Intervention Network(JCOIN), as part of the NIH HEAL Initiative, is a program aimed at improving access to evidence-based addiction treatment for individuals with opioid misuse and opioid use disorder (OUD) within the criminal justice system. It focuses on expanding effective treatment and care in collaboration with local and state justice systems and community-based treatment providers. For more information, visit the NIH JCOIN Website.
    • The JCOIN Data Commons (JDC) is a platform designed to support JCOIN's research goals. It facilitates cross-site data synchronization, harmonization, and reporting, along with providing analytic tools for data analysis. Moreover, JDC provides open access to a wide range of datasets, and further details can be found here.
  • NCI Cancer Research Data Commons
    • The Cancer Research Data Commons (CRDC) is an initiative by the National Cancer Institute (NCI) with the aim of empowering researchers to accelerate scientific discovery through data-driven approaches. It achieves this by connecting diverse datasets with analytical tools in the cloud. The CRDC enables researchers to combine diverse data types and conduct cross-domain analyses of large cancer datasets, leading to potential breakthroughs in cancer prevention, treatment, and diagnosis.
    • By utilizing the Cancer Data Aggregator and a common data model called CRDC-H, researchers can search and aggregate data from multiple repositories, including datasets like The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and The Clinical Proteomic Tumor Analysis Consortium (CPTAC). These data resources are made accessible through multiple platforms within the CRDC, including:
      • Genomic Data Commons (GDC): Focused on genomic data, GDC provides researchers with comprehensive access to genomic information from various cancer types.
      • Proteomic Data Commons (PDC): This platform specializes in proteomic data, enabling researchers to explore protein-related data for cancer research.
      • Integrated Canine Data Commons (ICDC): ICDC is dedicated to data related to comparative oncology, particularly data from canine cancer studies that can be relevant to human cancer research.
      • Imaging Data Commons (IDC): IDC offers access to imaging data, supporting the analysis of cancer images for research purposes.
      • Cancer Data Commons (CDS): CDS provides data storage and sharing capabilities for NCI-funded studies that don't fit existing CRDC data commons' criteria, currently hosting genomic and imaging data, among others, from various NCI projects and independent research.
  • MIDRC Data Commons
    • The Medical Imaging & Data Resource Center (MIDRC) Data Commons facilitates the management, analysis, and sharing of medical imaging data.
    • The MIDRC Data Commons provides open access to various types of data, including imaging files, patient demographic data, COVID-19 test results, and other clinical information. For more information on the datasets, you can visit their website.
  • NIH Access Clinical Data
    • AccessClinicalData@NIAID is a secure, cloud-based data platform that facilitates sharing and access to reports and datasets from NIAID-sponsored clinical trials, including COVID-19 studies. This platform empowers the research community to generate new knowledge and advancements in understanding, treating, and preventing infectious diseases like COVID-19, ultimately contributing to improved health outcomes worldwide.
    • For more information on available data, please refer to the clinical trials. To request access to the data, researchers need to complete a Data Access Request (DAR) online form.
  • IBDGC Data Commons
    • The Inflammatory Bowel Disease Genetics Consortium (IBDGC) Data Commons assists in managing, analyzing, and sharing genetic data related to Inflammatory Bowel Disease (IBD).
    • IBDGC offers open access to researchers and the general public. More information on the data can be found here.
  • OADC
    • The Open Access Data Commons (OADC) serves as a platform for data management, analysis, and sharing, with the main goal of expediting the discovery and advancement of diagnostics, treatments, and preventive measures for diseases.
    • Researchers and the general public have open access to this valuable data, enabling them to advance scientific knowledge and improve healthcare outcomes. More information on the datasets in OADC is provided here.
  • CANINE Data Commons
    • The Canine Data Commons facilitates the organization and sharing of genomics data within the canine research community, with the primary objective of accelerating opportunities for advancements in the detection and treatment of canine cancer.
    • This is an open-access platform. More information about the datasets available within the Canine Data Commons can be found here.
  • BioDataCatalyst
    • National Heart, Lung, and Blood Institute (NHLBI) BioData Catalyst (BDC) is a data common aimed at aiding the research community in managing and sharing human disease data to advance our understanding of genetic complexities and expedite the development of treatments and diagnostics for diseases, including cancer.
    • Some parts of the data within BDC are open access, including the well-known Framingham Heart Study-Cohort. Additional information about the available data can be found here.
  • AnVIL
    • The AnVIL data common supports the research community in data management, aiming to improve our comprehension of the genetic basis of complex traits and accelerate the discovery and development of treatments, diagnostic tests, and technologies for diseases.
    • A small portion of the data in AnVIL is accessible to the public. For further details about the available data, visit here.

Q: What is BRH Workspace?¶

The BRH Workspace is an advanced cloud-based environment designed to empower researchers and authorized users with seamless access to data within the BRH data ecosystem. It serves as a virtual laboratory, where data analysis and exploration are conducted securely and efficiently, unlocking a treasure trove of medical data for investigation.

Similar to a real laboratory, the BRH workspace offers a comprehensive suite of tools and resources for performing diverse data-related tasks, ranging from querying and visualization to robust statistical analysis. The true strength of this workspace lies in its ability to facilitate interactions with data from various repositories within the BRH. By doing so, users can derive valuable insights from diverse sources and perform cross-referenced analyses, leading to more comprehensive and illuminating discoveries.

One of its prominent features is its unwavering focus on data security and privacy. Access is granted solely to authorized individuals or groups vetted by the organizations managing the data resources. This stringent measure ensures that sensitive data is safeguarded and that all data usage complies strictly with pertinent regulations and policies.

At the heart of its design, the BRH Workspace fosters collaboration and interdisciplinary research. It opens doors for individuals with varying levels of expertise to gain valuable insights into the fascinating world of biomedicine. As a result, scientists and healthcare professionals can make informed decisions and drive the advancement of biomedical research, while even those with less professional background can participate and contribute to the ever-growing medical knowledge. This democratization of knowledge paves the way for new perspectives and potential career paths in healthcare and research.

In conclusion, the BRH Workspace stands as an exciting and engaging platform, revolutionizing biomedical research. Through its secure and collaborative nature, it unlocks opportunities for groundbreaking discoveries, benefiting both seasoned professionals and enthusiastic learners alike in their quest to unravel the mysteries of biomedicine.

BRH Setup¶

This notebook should be run in the Workspace on BRH. To use the Workspace, you need to login and request Workspace access here.

This notebook uses open access data from the Justice Community Opioid Intervention Network (JCOIN) Data Commons. Before running the code in the notebook on Workspace, you need to make sure you have authorized JCOIN Google Login through the BRH profile page.

No description has been provided for this image

After you successfully authorized JCOIN Google Login, go to the BRH Workspace page.

No description has been provided for this image

The BRH workspace offers several Jupyter notebook options for doing analysis. Jupyter notebooks offer a versatile environment for interactive data analysis and modeling by executing code in cells with the flexibility to provide adjacent comments or context using Markdown.

There are 3 Jupyter notebook options for your working environment:

  • Generic: The generic Jupyter notebook image allows you to create a new notebook using either a Python or an R kernel. In this notebook, you can create your own Python or R code to import and analyze data.

  • Nextflow: Nextflow is an open-source computational workflow framework designed for bioinformatics and data analysis. The Nextflow notebook image enables researchers to create, execute, and share data analysis pipelines.

  • Tutorials: The tutorials Jupyter notebook image is preloaded with a number of tutorial notebooks that illustrate several approaches to exploring BRH data in the workspace. This JCOIN notebook is one of the tutorials.

No description has been provided for this image

We will lauch the Tutorials notebook image and start a Jupyter notebook with Python 3 kernel.

Now that we have launched the tutorial notebook workspace, let's explore the scientific question we will investigate today.

Introduction to the Question¶

Background¶

Opioid overdose has been a long-standing and difficult-to-solve problem in the United States. Throughout the COVID-19 pandemic, the crisis of Opioid Use Disorder (OUD) has severely escalated, resulting in over 100,000 drug overdose deaths in the US from April 2020 to 2021, a 28.5% increase compared to the previous year (reference). Medications for Opioid Use Disorder (MOUD), such as buprenorphine, methadone, and naltrexone, have been clinically proven to effectively reduce opioid use and mitigate negative outcomes (reference). However, despite the evidence of effectiveness, misinformation and stigma against medications for OUD among personnel in the justice and penal systems, as well as the general public, have prevented access to this successful treatment for most people with OUD (reference). Some examples of this include a mistaken belief among the general public and medical, judicial, and correctional professionals that there is no effective treatment for OUD, or that medication-based treatment is substituting one drug for another, or that medication for OUD can only be safely delivered in an in-patient environment.

To address the opioid crisis and try to mitigate these barriers to MOUD, state and local officials have begun to implement various strategies and enhance access to medication-assisted treatment for addiction.

Our objectives¶

In this analysis, we will visualize three pandemic-era changes in opioid treatment policy at state levels across the US. Using data from the Justice Community Opioid Innovation Network (JCOIN) Data Commons, we examine which states have approved the following treatments, and when:

  1. Telehealth Treatment: This policy change permits telehealth OUD treatment with buprenorphine or methadone for patients already in an treatment program.
  2. SUD Medicaid Waiver: This type of policy change includes a diverse selection of changes to increase MOUDs covered by Medicaid, the environments in which they may be administered or prescribed, or the population of receipients that may be eligible for MOUD. These changes reflect specifically whether the state has received an approved section 1115 Medicaid waiver addressing substance use disorder (SUD) treatment.
  3. Incarceration MOUD Treatment: This type of policy increases access to MOUD inside some state correctional facilities, at least during the COVID-19 pandemic.

Code¶

Import Libraries¶

The first step is to import the pandas python library and give it the alias pd.

A Python library is a collection of pre-written code, functions, classes, and modules that offer a set of specific functionalities and tools. The pandas library offers a powerful set of tools to manipulate, clean, and analyze data efficiently. For more information about pandas, please see its documentation.

When you import a library in Python, several actions take place, including initializing the library, loading its functions and classes into memory, and making them available for use in your code.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd

Import Data¶

Import a data file and codebook for each of the three policy changes we are visualizing (Telehealth treatment, SUD Medicaid Waiver, Incarceration MOUD Treatment) using the Gen3 Python SDK. For more explanations on the command drs-pull, please refer to the documentation here.

The Gen3 Python SDK (Software Development Kit) is an open-source toolkit that provides users with tools and utilities to interact with the Gen3 platform programmatically, including functions and APIs related to managing data on the Gen3 platform.

The names after object, e.g.dg.6VTS/5200158e-e9fe-44ef-96c9-e89ecd402fc4, are the Global Unique Identifiers (GUIDs) for the files.

In [2]:
!gen3 --commons_url jcoin.datacommons.io drs-pull object dg.6VTS/5200158e-e9fe-44ef-96c9-e89ecd402fc4
!gen3 --commons_url jcoin.datacommons.io drs-pull object dg.6VTS/2b83e419-8d3d-4569-b9a1-a52ecd387cba
!gen3 --commons_url jcoin.datacommons.io drs-pull object dg.6VTS/a0a8785a-8663-47b9-95ea-a1813612a2f1
!gen3 --commons_url jcoin.datacommons.io drs-pull object dg.6VTS/abe9cd49-fc86-4c9b-b9d0-f8c0280d8aaa
!gen3 --commons_url jcoin.datacommons.io drs-pull object dg.6VTS/b7974ffe-2e46-47cf-9d57-4d8900d7a40f
!gen3 --commons_url jcoin.datacommons.io drs-pull object dg.6VTS/dca15d95-aac5-4879-88cb-3a740398f26c
{"succeeded": ["dg.6VTS/5200158e-e9fe-44ef-96c9-e89ecd402fc4"], "failed": []}
{"succeeded": ["dg.6VTS/2b83e419-8d3d-4569-b9a1-a52ecd387cba"], "failed": []}
{"succeeded": ["dg.6VTS/a0a8785a-8663-47b9-95ea-a1813612a2f1"], "failed": []}
{"succeeded": ["dg.6VTS/abe9cd49-fc86-4c9b-b9d0-f8c0280d8aaa"], "failed": []}
{"succeeded": ["dg.6VTS/b7974ffe-2e46-47cf-9d57-4d8900d7a40f"], "failed": []}
{"succeeded": ["dg.6VTS/dca15d95-aac5-4879-88cb-3a740398f26c"], "failed": []}

Now, if you look to the directory shown in the left panel, you will see 6 new files in the folder, including 3 Excel files containing the data and their corresponding documentations in 3 pdf files.

No description has been provided for this image

Next, we use the pandas.read_excel function to read the 3 data files that we want to analyze into pandas DataFrames named df1, df2, and df3.

In [3]:
df1 = pd.read_excel('buprenorphine-and-methadone-during-covid-19-data-020222.xlsx')
df2 = pd.read_excel('covid-19-state-medicaid-waivers-data-020222.xlsx')
df3 = pd.read_excel('covid-19-moud-at-state-correctional-facilities-data-020222.xlsx')

Before we move on, let's first take a look at what the 3 DataFrames look like. The head(n) function returns the first n rows in the DataFrame.

Explore DataFrame1 (Telehealth Treatment)¶

In [4]:
df1.head(10)
Out[4]:
Jurisdictions Effective Date Valid Through Date Initial_inperson_bup Telehealth_existing_bup Telehealth_existing_methadone Medicaid_PA Medicaid_bup_PA Medicaid_methadone_PA Commercial_PA Commercial_bup_PA Commercial_methadone_PA Extended_dosage_bup Extended_dosage_methadone Extended_prescription_SAMHSAexempt Coprescribing_bup Coprescribing_methadone naloxone_nonspecific
0 Alabama 2021-05-03 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1
1 Alaska 2021-06-01 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1
2 Arizona 2021-06-01 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 1 1 1
3 Arkansas 2021-06-01 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1
4 California 2021-06-01 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1
5 Colorado 2021-06-01 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1
6 Connecticut 2021-05-18 2021-06-01 0 0 0 0 . . 0 . . 0 1 0 0 0 1
7 Delaware 2021-01-25 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 0
8 District of Columbia 2021-06-01 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1
9 Florida 2020-08-10 2021-06-01 0 0 0 0 . . 0 . . 0 0 0 0 0 1

Here we see there is a column named Jurisdictions that gives us the name of the state, and two columns named Telehealth_existing_bup and Telehealth_existing_methadone that tell us whether the state approved telehealth MOUD treatment with buprenorphine or methadone, respectively. A 0 means "not approved" and 1 means "approved". This will be the data visualized for the first policy change type, Telehealth Treatment. We can find more information for df1 in its documentation, "buprenorphine-and-methadone-during-covid-19-codebook-020222.pdf".

Explore DataFrame2 (SUD Medicaid Waiver)¶

In [5]:
df2.head(10)
Out[5]:
Jurisdictions Effective Date Valid Through Date JM_Waiver_Yes, Section 1115 JM_WaiverYes, Section 1135 JM_WaiverYes, Section 1915(c) JM_1115 JM_15sud JM_15slim JM_1135 JM_25sud JM_35tele Med_SPA JM_35pa JM_1915 JM_19sud JM_19vir JM_19virexp Med_exp
0 Alabama 2021-03-12 2021-07-01 0 1 1 0 . . 1 0 0 1 0 1 0 1 0 0
1 Alaska 2021-03-25 2021-07-01 0 1 1 0 . . 1 0 1 0 1 1 0 1 0 1
2 Arizona 2021-06-24 2021-07-01 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 1
3 Arkansas 2021-05-26 2021-07-01 0 1 1 0 . . 1 0 0 1 1 1 0 1 0 1
4 California 2021-03-12 2021-07-01 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1
5 Colorado 2021-04-16 2021-07-01 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1
6 Connecticut 2021-03-12 2021-07-01 0 1 1 0 . . 1 0 1 1 1 1 0 1 0 1
7 Delaware 2021-03-15 2021-07-01 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 1
8 District of Columbia 2021-03-12 2021-07-01 0 1 1 0 . . 1 0 0 1 1 1 0 1 0 1
9 Florida 2021-03-16 2021-07-01 0 1 1 0 . . 1 0 0 1 1 1 0 1 0 0

Similarly, in this DataFrame, we also get a column called Jurisdictions for the states and another column JM_15sub with values 0, ., and 1 indicating whether the state received an approved section 1115 Medicaid waiver that explicitely address SUD treatment. Here, like above, 0 means "not approved" and 1 means "approved", while . indicates a missing value. This corresponds to question 2. For more information of this DataFrame df2, you can read the documentation "covid-19-state-medicaid-waivers-for-sud-codebook-020222.pdf".

Explore DataFrame3 (Incarceration MOUD Treatment)¶

In [6]:
df3.head(10)
Out[6]:
Jurisdictions Effective Date Valid Through Date JC_law JC_MOUDlaw JC_MOUD JC_MCont JC_release JC_visit JC_visitrest
0 Alabama 2020-06-09 2021-09-01 1 0 0 . 0 1 0
1 Alaska 2021-04-28 2021-09-01 1 0 0 . 0 1 2
2 Arizona 2021-06-19 2021-09-01 1 0 0 . 0 1 2
3 Arkansas 2021-07-16 2021-09-01 1 0 0 . 0 1 2
4 California 2021-04-10 2021-09-01 1 1 1 1 1 1 2
5 Colorado 2021-06-28 2021-09-01 1 1 1 1 1 1 2
6 Connecticut 2021-06-17 2021-09-01 1 1 1 0 0 1 2
7 Delaware 2021-03-16 2021-09-01 1 0 1 0 0 1 2
8 District of Columbia 2020-11-25 2021-09-01 1 0 0 . 1 1 2
9 Florida 2020-10-02 2021-09-01 1 0 0 . 0 1 2

For the last DataFrame, beside the column Jurisdiction for the states, we have the column JC_MOUDlaw with values 0 and 1 indicating whether the state correctional facilities approved modified policies explicitly related to MOUD treatment during COVID-19. This corresponds to question 3. The documentation for df3 is given in "covid-19-moud-at-state-correctional-facilities-codebook-021022.pdf".

Clean Data¶

Now, we have all data imported and ready. But in the DataFrames we just saw, there are many redundant columns that we don't need in our analysis. The next step is to clean the data for easier analysis.

We start with keeping only the useful columns in the DataFrames.

In [7]:
# Creates a new column called 'Policy Change' indicating whether the state approved telehealth MOUD treatment with either buprenorphine or methadone
# The sum of the values in columns 'Telehealth_existing_bup' and 'Telehealth_existing_methadone' is zero only when the state did not approve telehealth MOUD treatment with either buprenorphine or methadone
# The .astype(int) at the end converts a boolean value into an integer value
df1['Policy Change'] = ((df1['Telehealth_existing_bup'] + df1['Telehealth_existing_methadone']) != 0).astype(int)

# Create a new column called 'Policy' containing information about which policy is addressed in the row
df1['Policy'] = 'Telehealth MOUD Treatment'

# Reduce the DataFrame df1 to only containing the four columns 'Jurisdictions', 'Effective Date', 'Policy Change', 'Policy'
df1 = df1[['Jurisdictions','Effective Date', 'Policy Change', 'Policy']]

# Let's take a look at the new df1
df1.head()
Out[7]:
Jurisdictions Effective Date Policy Change Policy
0 Alabama 2021-05-03 0 Telehealth MOUD Treatment
1 Alaska 2021-06-01 0 Telehealth MOUD Treatment
2 Arizona 2021-06-01 0 Telehealth MOUD Treatment
3 Arkansas 2021-06-01 0 Telehealth MOUD Treatment
4 California 2021-06-01 0 Telehealth MOUD Treatment

Before we go on to df2, recall the values in the column JM_15sud contains values 0, ., 1, where . indicates a missing value. For the missing values, we will categorized them as "not approved". We want to unify all three DataFrames to use 0 for "not approved" and 1 for "approved". In order to achieve this, we define a dictionary.

A dictionay is a useful python data structure, written with curly brackets containing keys and values in the format {key:value}. For more information, you can read the documentation.

In the line below, we define a dictionary called waiverMap. The keys for our dictionary are '.', 0, and 1, while the values are 0 and 1. Furthermore, we associate the keys to the values by defining a map:

  • '.' -> 0
  • 0 -> 0
  • 1 -> 1
In [8]:
waiverMap = {'.': 0, 0:0, 1:1}
In [9]:
# Create a new column 'Policy Change' indicating whether the state received an approved section 1115 Medicaid waiver that explicitely address SUD treatment
# The .map(waiverMap) converts all the '.' into '0'
df2['Policy Change'] = df2['JM_15sud'].map(waiverMap)

# Create a new column called 'Policy' containing information about which policy is addressed in the row
df2['Policy'] = 'SUD Medicaid Waiver'

# Reduce the DataFrame df2 to only containing the four columns 'Jurisdictions', 'Effective Date', 'Policy Change', 'Policy'
df2 = df2[['Jurisdictions','Effective Date', 'Policy Change', 'Policy']]

# Let's take a look at the new df2
df2.head()
Out[9]:
Jurisdictions Effective Date Policy Change Policy
0 Alabama 2021-03-12 0 SUD Medicaid Waiver
1 Alaska 2021-03-25 0 SUD Medicaid Waiver
2 Arizona 2021-06-24 0 SUD Medicaid Waiver
3 Arkansas 2021-05-26 0 SUD Medicaid Waiver
4 California 2021-03-12 1 SUD Medicaid Waiver
In [10]:
# Create a new column 'Policy Change' indicating whether the state correctional facilities approved modified policies explicitly related to MOUD treatment during COVID-19
df3['Policy Change'] = df3['JC_MOUDlaw']

# Create a new column called 'Policy' containing information about which policy is addressed in the row
df3['Policy'] = 'MOUD Treatment While Incarcerated'

# Reduce the DataFrame df3 to only containing the three columns 'Jurisdictions', 'Effective Date', 'Policy Change', 'Policy'
df3 = df3[['Jurisdictions', 'Effective Date', 'Policy Change', 'Policy']]

# Let's take a look at the new df3
df3.head()
Out[10]:
Jurisdictions Effective Date Policy Change Policy
0 Alabama 2020-06-09 0 MOUD Treatment While Incarcerated
1 Alaska 2021-04-28 0 MOUD Treatment While Incarcerated
2 Arizona 2021-06-19 0 MOUD Treatment While Incarcerated
3 Arkansas 2021-07-16 0 MOUD Treatment While Incarcerated
4 California 2021-04-10 1 MOUD Treatment While Incarcerated

You might have noticed that we changed the three DataFrames df1, df2, and df3 to have the same columns. This allows us to concatenate the three DataFrames together (stacks them vertically, with aligned columns), and creates a new DataFrame df with a continuous index.

In [11]:
df = pd.concat([df1, df2, df3],ignore_index=True)

Next, we define a dictionary for mapping the names of States and Territories to their respective abbreviations. The dictionary state_abbrev we define below contains the pairs of {state:state abbreviation} for all states and territories of the United States.

In [12]:
state_abbrev = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", 
    "Connecticut": "CT", "Delaware": "DE", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", "Idaho": "ID",
    "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA",
    "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", "Minnesota": "MN", 
    "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", "Nevada": "NV",
    "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", "North Carolina": "NC", 
    "North Dakota": "ND", "Ohio": "OH",  "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA", 
    "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", 
    "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", 
    "Wisconsin": "WI", "Wyoming": "WY", "District of Columbia": "DC", "American Samoa": "AS", "Guam": "GU", 
    "Northern Mariana Islands": "MP", "Puerto Rico": "PR", "United States Minor Outlying Islands": "UM", 
    "U.S. Virgin Islands": "VI"}

Using the dictionary above, we may add a new column to df containing the abbreviations of the states in each row.

In [13]:
df['Abbreviations'] = df['Jurisdictions'].map(state_abbrev)

The next block of code uses dictionary again to change the integers 0 and 1 in the Change In Policy column into strings Not Approved and Approved respectively.

In [14]:
policyMap = {0: 'Not Approved', 1:'Approved'}
df['Change In Policy'] = df['Policy Change'].map(policyMap)

Lastly, we reorder the columns for the final cleaned DataFrame df.

In [15]:
df = df[['Jurisdictions', 'Abbreviations', 'Effective Date', 'Change In Policy', 'Policy']]

# Print the DataFrame
df
Out[15]:
Jurisdictions Abbreviations Effective Date Change In Policy Policy
0 Alabama AL 2021-05-03 Not Approved Telehealth MOUD Treatment
1 Alaska AK 2021-06-01 Not Approved Telehealth MOUD Treatment
2 Arizona AZ 2021-06-01 Not Approved Telehealth MOUD Treatment
3 Arkansas AR 2021-06-01 Not Approved Telehealth MOUD Treatment
4 California CA 2021-06-01 Not Approved Telehealth MOUD Treatment
... ... ... ... ... ...
148 Virginia VA 2021-08-01 Approved MOUD Treatment While Incarcerated
149 Washington WA 2021-05-09 Not Approved MOUD Treatment While Incarcerated
150 West Virginia WV 2021-02-15 Not Approved MOUD Treatment While Incarcerated
151 Wisconsin WI 2021-07-06 Not Approved MOUD Treatment While Incarcerated
152 Wyoming WY 2020-03-18 Not Approved MOUD Treatment While Incarcerated

153 rows × 5 columns

Data Visualization¶

So far, we have completed all data cleaning and prepared our data for analysis.

Now, let's consider visualizing our findings. One effective way to do this is by creating a map of the United States, where each state is represented by a different color based on the approval status of various policies. This visualization will provide a clear and intuitive view of the data, allowing us to easily discern regional trends and differences in policy approval across the country.

The !pip install geopandas command is used in Jupyter Notebook or IPython environments to install the geopandas library.

In [16]:
!pip install geopandas
Requirement already satisfied: geopandas in /opt/conda/lib/python3.9/site-packages (1.0.1)
Requirement already satisfied: numpy>=1.22 in /opt/conda/lib/python3.9/site-packages (from geopandas) (2.0.2)
Requirement already satisfied: pyogrio>=0.7.2 in /opt/conda/lib/python3.9/site-packages (from geopandas) (0.10.0)
Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from geopandas) (24.1)
Requirement already satisfied: pandas>=1.4.0 in /opt/conda/lib/python3.9/site-packages (from geopandas) (2.2.2)
Requirement already satisfied: pyproj>=3.3.0 in /opt/conda/lib/python3.9/site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from geopandas) (2.0.7)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.9/site-packages (from pandas>=1.4.0->geopandas) (2.9.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.9/site-packages (from pandas>=1.4.0->geopandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.9/site-packages (from pandas>=1.4.0->geopandas) (2024.2)
Requirement already satisfied: certifi in /opt/conda/lib/python3.9/site-packages (from pyogrio>=0.7.2->geopandas) (2025.1.31)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.16.0)

Next, we import some libraries needed for our plots.

  • matplotlib.pyplot is a module provided by the matplotlib library in Python, which is a powerful plotting and visualization library. The pyplot module offers a collection of functions that provide a MATLAB-like interface for creating various types of plots, charts, and visualizations. Here is the documentation.

  • Geopandas is an open-source Python library that extends the capabilities of Pandas, a popular data manipulation library, to handle geospatial data. It provides a convenient and efficient way to work with geographical data, including points, lines, polygons, and other geometric shapes, along with their associated attributes. Here is the documentation.

In [17]:
import matplotlib.pyplot as plt
import geopandas as gpd

To plot the map of the states, we need to download the shapefile cb_2018_us_state_500k.zip[3.2MB] from the website United States Census Bureau.

A shapefile is a standard file format used in Geographic Information Systems (GIS) to store geospatial data. It consists of multiple files, including the .shp file for geometric data, .dbf file for attribute data, .shx file as an index, and .prj file for coordinate reference system information. Shapefiles are widely used for mapping, visualization, and analysis of geographic features in various fields.

Let's create a folder on the left hand side called 'shapefile' and move everything in the zipped filed into the folder.

No description has been provided for this image

Read the filename cb_2018_us_state_500k.shp and create a GeoDataFrame named geo_usa. This GeoDataFrame will contain the geometries (boundaries) and associated attributes of the U.S. states.

In [18]:
geo_usa = gpd.read_file('shapefile/cb_2018_us_state_500k.shp')
In [19]:
geo_usa.head()
Out[19]:
STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD ALAND AWATER geometry
0 28 01779790 0400000US28 28 MS Mississippi 00 121533519481 3926919758 MULTIPOLYGON (((-88.50297 30.21524, -88.49176 ...
1 37 01027616 0400000US37 37 NC North Carolina 00 125923656064 13466071395 MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
2 40 01102857 0400000US40 40 OK Oklahoma 00 177662925723 3374587997 POLYGON ((-103.00256 36.52659, -103.00219 36.6...
3 51 01779803 0400000US51 51 VA Virginia 00 102257717110 8528531774 MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
4 54 01779805 0400000US54 54 WV West Virginia 00 62266474513 489028543 POLYGON ((-82.6432 38.16909, -82.643 38.16956,...

Then we perform a merge operation between two DataFrames, df and geo_usa, that combines the data from both DataFrames based on the columns 'Abbreviations' in df and 'STUSPS' in geo_usa. The result, stored in the new DataFrame df_merge, contains the information from both DataFrames aligned on matching state abbreviations, effectively adding additional geographic data to the original df DataFrame.

In [20]:
df_merge = df.merge(geo_usa, left_on='Abbreviations', right_on='STUSPS')
In [21]:
df_merge.head()
Out[21]:
Jurisdictions Abbreviations Effective Date Change In Policy Policy STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD ALAND AWATER geometry
0 Alabama AL 2021-05-03 Not Approved Telehealth MOUD Treatment 01 01779775 0400000US01 01 AL Alabama 00 131174048583 4593327154 MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ...
1 Alaska AK 2021-06-01 Not Approved Telehealth MOUD Treatment 02 01785533 0400000US02 02 AK Alaska 00 1478839695958 245481577452 MULTIPOLYGON (((179.48246 51.98283, 179.48656 ...
2 Arizona AZ 2021-06-01 Not Approved Telehealth MOUD Treatment 04 01779777 0400000US04 04 AZ Arizona 00 294198551143 1027337603 POLYGON ((-114.81629 32.50804, -114.81432 32.5...
3 Arkansas AR 2021-06-01 Not Approved Telehealth MOUD Treatment 05 00068085 0400000US05 05 AR Arkansas 00 134768872727 2962859592 POLYGON ((-94.61783 36.49941, -94.61765 36.499...
4 California CA 2021-06-01 Not Approved Telehealth MOUD Treatment 06 01779778 0400000US06 06 CA California 00 403503931312 20463871877 MULTIPOLYGON (((-118.60442 33.47855, -118.5987...

The last step before plotting the graph is to convert the 'Change In Policy' column in df_merge to a categorical variable with two categories: "Not Approved" and "Approved".

In [22]:
df_merge['Change In Policy'] = pd.Categorical(df_merge['Change In Policy'], categories = ['Approved','Not Approved'])

Single-Feature Choropleth Maps¶

Now, we generate a choropleth map that visualizes the 'Change In Policy' for Telehealth MOUD Treatment across the states. The states with the policy approved will be colored in the 'Pastel1' colormap, and the states with the policy not approved will be left blank (gray). Try to change the color yourself using cmap.

In [23]:
# Create a new GeoDataFrame 'gdf_TMT' by filtering 'df_merge' to include only rows where the 'Policy' column is 'Telehealth MOUD Treatment'.
gdf_TMT = gpd.GeoDataFrame(df_merge[df_merge['Policy'] == 'Telehealth MOUD Treatment'])

# Plot the GeoDataFrame 'gdf_TMT' on a map, using the 'Change In Policy' column to determine the color of each geographic region.
# Set 'categorical=True' to indicate that the values in 'Change In Policy' are categorical, and use 'cmap='Pastel1'' for the colormap.
# Show a legend to explain the colors based on the categorical values.
gdf_TMT.plot(column='Change In Policy', categorical=True, figsize=(10, 10), legend=True, cmap='Pastel1')

# Set the x-axis and y-axis limits of the plot to focus on a specific geographic region.
plt.xlim(-180, -60)
plt.ylim(20, 75)

# Set the title of the plot to 'Change in policy for Telehealth MOUD Treatment' with a font size of 25.
plt.title('Change in policy for Telehealth MOUD Treatment', fontsize=15)

# Display the plot.
plt.show()
No description has been provided for this image

Now, we generate a choropleth map for 'Change In Policy' for SUD Medicaid Waiver across the states.

In [24]:
# Create a new GeoDataFrame 'gdf_SUD' by filtering 'df_merge' to include only rows where the 'Policy' column is 'SUD Medicaid Waiver'.
gdf_SUD = gpd.GeoDataFrame(df_merge[df_merge['Policy'] == 'SUD Medicaid Waiver'])

# Plot the GeoDataFrame 'gdf_SUD' on a map, using the 'Change In Policy' column to determine the color of each geographic region.
# Set 'categorical=True' to indicate that the values in 'Change In Policy' are categorical, and use 'cmap='Paired'' for the colormap.
# Show a legend to explain the colors based on the categorical values.
gdf_SUD.plot(column='Change In Policy', categorical=True, figsize=(10, 10), legend=True, cmap='Paired')

# Set the x-axis and y-axis limits of the plot to focus on a specific geographic region.
plt.xlim(-180, -60)
plt.ylim(20, 75)

# Set the title of the plot to 'Change in policy for SUD Medicaid Waiver' with a font size of 25.
plt.title('Change in policy for SUD Medicaid Waiver', fontsize=15)

# Display the plot.
plt.show()
No description has been provided for this image

Lastly, we generate a choropleth map to visualize the 'Change In Policy' for MOUD Treatment.

In [25]:
# Create a new GeoDataFrame 'gdf_MTI' by filtering 'df_merge' to include only rows where the 'Policy' column is 'MOUD Treatment While Incarcerated'.
gdf_MTI = gpd.GeoDataFrame(df_merge[df_merge['Policy'] == 'MOUD Treatment While Incarcerated'])

# Plot the GeoDataFrame 'gdf_MTI' on a map, using the 'Change In Policy' column to determine the color of each geographic region.
# Set 'categorical=True' to indicate that the values in 'Change In Policy' are categorical, and use 'cmap='tab10'' for the colormap.
# Show a legend to explain the colors based on the categorical values.
gdf_MTI.plot(column='Change In Policy', categorical=True, figsize=(10, 10), legend=True, cmap='tab10')

# Set the x-axis and y-axis limits of the plot to focus on a specific geographic region.
plt.xlim(-180, -60)
plt.ylim(20, 75)

# Set the title of the plot to 'Change in policy for MOUD Treatment While Incarcerated' with a font size of 25.
plt.title('Change in policy for MOUD Treatment While Incarcerated', fontsize=15)

# Display the plot.
plt.show()
No description has been provided for this image

Multi-Feature Choropleth Maps¶

Now we have generated 3 plots for the 3 different policies. Is there a way to integrate the 3 graphs, so that we can compare the changes for different policies more easily?

The answer is a multi-feature choropleth map!

Multi-feature choropleth maps provide the advantage of visualizing multiple variables simultaneously on a single map, enabling easier data comparison, enhanced insights, and reduced bias. They offer a space-efficient and contextually rich representation, making them powerful tools for storytelling, interactive exploration, and comprehensive geospatial analysis.

To generate a multi-feature choropleth map, we will use the library, Plotly Express, which provides a simple and intuitive interface for creating interactive visualizations. It is built on top of the Plotly graphing library, allowing users to generate a wide range of charts, including scatter plots, bar charts, line charts, choropleth maps, and more, with minimal code. You can find its documentation here.

In [26]:
import plotly.express as px

We can create choropleth maps using the plotly.express.choropleth function. Plotly comes with the built-in geometries for USA States. To use the USA States geometry, set locationmode="USA-states" and provide locations as two-letter state abbreviations. To learn more about choropleth maps, you can visit the documentation.

We can show multiple features or the same feature over a period of time using the animation_frame argument. Here we create an interactive plot using animation_frame='Policy', which allows us to toggle between the three changes in public health policy which we are investigating.

Try to run the code below yourself and see what you get!

In [27]:
fig = px.choropleth(df, 
                    # Set the width of the plot (in pixels)
                    width=800,  
                    # Set the height of the plot (in pixels)
                    height=600, 
                    # Specify the column in 'df' that contains the two-letter state abbreviations.
                    locations='Abbreviations',                     
                    # Set the location mode to "USA-states" to use the built-in geometries for U.S. states.
                    locationmode="USA-states",                     
                    # Specify the column in 'df' that contains the data to be used for coloring the map.
                    color='Change In Policy',                     
                    # Define the color map for the categorical values.
                    color_discrete_map={'Not Approved':'Gray', 'Approved':'Purple'},                    
                    # Set the column in 'df' that represents the animation frame (e.g., policy changes over time).
                    animation_frame='Policy',                     
                    # Set the scope of the map to 'usa' to focus on the United States.
                    scope='usa',              
                    # Set the title of the choropleth map.
                    title='Changes In Opioid Treatment Policy During COVID-19')

# Display the choropleth map.
fig.show()

In addition to plotting the data out on a map, bar charts serve as a valuable tool for summarizing numerical information. Particularly effective when dealing with discrete or grouped numerical data, bar charts provide a clear and straightforward means of visualizing and comparing values.

We can analyze a specific policy by grouping states based on whether they approved it and then count the size of each group. As an example, let's visualize the data for the "Telehealth MOUD Treatment" policy using a bar chart.

In [28]:
# Group the data by 'Change In Policy' for the specific policy 'Telehealth MOUD Treatment' and calculate the size of each group
grouped_size = df[df['Policy'] == 'Telehealth MOUD Treatment'].groupby('Change In Policy').size()

# Plot the grouped_size data using a bar chart
grouped_size.plot(kind='bar')

# Set the title of the bar chart
plt.title('Change In Telehealth MOUD Treatment')

# Display the bar chart
plt.show()
No description has been provided for this image

It would be beneficial to have the bar charts for the three different policies displayed side by side. This arrangement allows for a more intuitive and immediate understanding of the variations and trends across the policies.

In [29]:
# Filter the DataFrame to select rows where the 'Policy' column matches 'Telehealth MOUD Treatment',
# then group the data by 'Change In Policy' and calculate the size of each group
data1 = df[df['Policy'] == 'Telehealth MOUD Treatment'].groupby('Change In Policy').size()

# Filter the DataFrame to select rows where the 'Policy' column matches 'SUD Medicaid Waiver',
# then group the data by 'Change In Policy' and calculate the size of each group
data2 = df[df['Policy'] == 'SUD Medicaid Waiver'].groupby('Change In Policy').size()

# Filter the DataFrame to select rows where the 'Policy' column matches 'MOUD Treatment While Incarcerated',
# then group the data by 'Change In Policy' and calculate the size of each group
data3 = df[df['Policy'] == 'MOUD Treatment While Incarcerated'].groupby('Change In Policy').size()

# Define custom colors for each category
colors = ['green', 'red']

# Set the number of states in US as a common y-axis limit
max_value = 50

# Creates a figure with three subplots arranged side by side in a row
# Each subplot will have the same height (sharey=True) to ensure the y-axis scale is consistent across all three subplots
# The fig variable represents the entire figure
# The axs variable is a NumPy array containing three axes objects, one for each subplot.
fig, axs = plt.subplots(1, 3, figsize=(12, 5), sharey=True)

# Plot the first bar chart in the first subplot
axs[0].bar(data1.index, data1.values, color=colors)
axs[0].set_title('Telehealth MOUD Treatment')
axs[0].set_ylim(0, max_value)  # Set common y-axis limit

# Plot the second bar chart in the second subplot
axs[1].bar(data2.index, data2.values, color=colors)
axs[1].set_title('SUD Medicaid Waiver')
axs[1].set_ylim(0, max_value)  # Set common y-axis limit

# Plot the third bar chart in the third subplot
axs[2].bar(data3.index, data3.values, color=colors)
axs[2].set_title('MOUD Treatment While Incarcerated')
axs[2].set_ylim(0, max_value)  # Set common y-axis limit

# Adjust layout for better spacing
plt.tight_layout()

# Show the combined plot
plt.show()
No description has been provided for this image

Next, we can also make use of the column Effective Date. We can generate a multi-feature choropleth map with a slider representing different dates. As you move the slider, the map will display the number of policies approved by each state up to that specific date, providing an insightful visual representation of policy changes over time.

To accomplish this, we want to create a new DataFrame that will hold data about the number of policies approved in each state at specific points in time.

In [30]:
# Define the start and end dates for the time period
start_date = '2020-01-01'
end_date = '2021-12-01'

# Generate a list of timestamps at 3-month intervals within the specified range
timestamps = pd.date_range(start=start_date, end=end_date, freq='3M')
timestamps_list = timestamps.to_list()

# Define the columns for the new DataFrame
columns = ['Time', 'Abbreviations', 'Number of approved policies']

# Create an empty DataFrame to store the data
df_plot = pd.DataFrame(columns=columns)

# Loop through each timestamp in the list
for time in timestamps_list:
    # Filter the original DataFrame to include only rows with 'Effective Date' less than or equal to the current timestamp
    # and where the policy is approved ('Change In Policy' == 'Approved')
    df_prev = df[(df['Effective Date'] <= time) & (df['Change In Policy'] == 'Approved')]
    
    # Create a list of state abbreviations
    state_abbrev_list = list(state_abbrev.values())
    
    # Loop through each state abbreviation
    for state in state_abbrev_list:
        # Count the number of approved policies for the current state and timestamp
        number_of_approved_policies = df_prev[df_prev['Abbreviations'] == state].shape[0]
        
        # Create a new row of data containing the timestamp, state abbreviation, and the number of approved policies
        new_row_data = [time, state, number_of_approved_policies]
        
        # Add the new row to the df_plot DataFrame
        df_plot.loc[len(df_plot)] = new_row_data

Now, the DataFrame df_plot will contain all the information we need, and we can use this data to create the animated choropleth map.

In [31]:
# Create the choropleth plot
# Convert the 'Time' column to string format for better representation in the choropleth map
df_plot['Time'] = df_plot['Time'].astype(str)

# Create the animated choropleth map using Plotly Express
fig = px.choropleth(df_plot,
                    # Set the width of the plot (in pixels)
                    width=800,  
                    # Set the height of the plot (in pixels)
                    height=600, 
                    # Specify the column in 'df_plot' that contains the state abbreviations
                    locations='Abbreviations',
                    # Set the location mode to "USA-states" to use the built-in geometries for U.S. states
                    locationmode='USA-states',
                    # Specify the column in 'df_plot' that contains the data to be used for coloring the map
                    color='Number of approved policies',
                    # Set the column in 'df_plot' that represents the animation frame (i.e., policy changes over time)
                    animation_frame='Time',
                    # Set the color scale for the choropleth map using the 'Inferno' color map
                    color_continuous_scale='Inferno',
                    # Set the color range for the map (minimum and maximum values)
                    range_color=[0, 3],
                    # Set the scope of the map to 'usa' to focus on the United States
                    scope='usa',
                    # Set the title of the choropleth map
                    title='Changes In Opioid Treatment Policy During COVID-19',
                    # Customize the labels for the color scale and axis
                    labels={'Number of approved policies': 'Approved Policies'}
)

# Show the plot
fig.show()

From this animation, we can see that California, Colorado, and Pennsylvania had the most policy changes enacted in this timeframe. We see that Maryland was the first to enact a policy change after the pandemic started. 29 states did not enact any policy changes during the pandemic, while 21 states enacted at least one change during this time.

In [ ]: